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Abstract 

Recently, we proposed to transform the outputs of each hidden neuron in a multi- 
layer perceptron network to have zero output and zero slope on average, and use 
separate shortcut connections to model the linear dependencies instead. We con- 
tinue the work by firstly introducing a third transformation to normalize the scale 
of the outputs of each hidden neuron, and secondly by analyzing the connections 
to second order optimization methods. We show that the transformations make 
a simple stochastic gradient behave closer to second-order optimization methods 
and thus speed up learning. This is shown both in theory and with experiments. 
The experiments on the third transformation show that while it further increases 
the speed of learning, it can also hurt performance by converging to a worse local 
optimum, where both the inputs and outputs of many hidden neurons are close to 
zero. 



1 Introduction 

Learning deep neural networks has become a popular topic since the invention of unsupervised 
pretraining |4|. Some later works have returned to traditional back-propagation learning in deep 
models and noticed that it can also provide impressive results [6] given either a sophisticated learning 
algorithm [9] or simply enough computational power [2]. In this work we study back-propagation 
learning in deep networks with up to five hidden layers, continuing on our earlier results in ifTOl . 

In learning multi-layer perceptron (MLP) networks by back-propagation, there are known transfor- 
mations that speed up learning O HU El • For instance, inputs are recommended to be centered to 
zero mean (or even whitened), and nonlinear functions are proposed to have a range from -1 to 1 
rather than to 1 jSl. Schraudolph lfT2l[TTI proposed centering all factors in the gradient to have 
zero mean, and further adding linear shortcut connections that bypass the nonlinear layer. Gradient 
factor centering changes the gradient as if the nonlinear activation functions had zero mean and zero 
slope on average. As such, it does not change the model itself. It is assumed that the discrepancy 
between the model and the gradient is not an issue, since the errors will be easily compensated 
by the linear shortcut connections in the proceeding updates. Gradient factor centering leads to a 
significant speed-up in learning. 
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In this paper, we transform the nonlinear activation functions in the hidden neurons such that they 
have on average 1) zero mean, 2) zero slope, and 3) unit variance. Our earlier results in ifTOl in- 
cluded the first two transformations and here we introduce the third one. cdsaasd We explain the 
usefulness of these transformations by studying the Fisher information matrix and the Hessian, e.g. 
by measuring the angle between the traditional gradient and a second order update direction with 
and without the transformations. 

It is well known that second-order optimization methods such as the natural gradient HI or New- 
ton's method decrease the number of required iterations compared to the basic gradient descent, 
but they cannot be easily used with high-dimensional models due to heavy computations with large 
matrices. In practice, it is possible to use a diagonal or block-diagonal approximation [7] of the 
Fisher information matrix or the Hessian. Gradient descent can be seen as an approximation of the 
second-order methods, where the matrix is approximated by a scalar constant times a unit matrix. 
Our transformations aim at making the Fisher information matrix as close to such matrix as possible, 
thus diminishing the difference between first and second order methods. Matlab code for replicating 
the experiments in this paper is available at 

|https : / / github . com/tvatanen/ ltmlp-neuralnet| 

2 Proposed Transformations 



Let us study a MLP-network with a single hidden layer and shortcut mapping, that is, the output 
column vectors y t for each sample t are modeled as a function of the input column vectors x t with 



y* = Af (Bx f ) + Cx t + e t , 



(1) 



where f is a nonlinearity (such as tanh) applied to each component of the argument vector sepa- 
rately, A, B, and C are the weight matrices, and e t is the noise which is assumed to be zero mean 
and Gaussian, that is, p{en) = Af (en; 0, of) . In order to avoid separate bias vectors that complicate 
formulas, the input vectors are assumed to have been supplemented with an additional component 
that is always one. 

Let us supplement the tanh nonlinearity with auxiliary scalar variables on, ft, and ji for each 
nonlinearity They are updated before each gradient evaluation in order to help learning of the 
other parameters A, B , and C. We define 



/i(biXt) = 7i [tanh(bjX t ) + c^bjXt + ft] 
where b; is the ith row vector of matrix B. We will ensure that 

T 
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One way to motivate the first two transformations in Equations ^ and Q, is to study the expected 
output y t and its dependency of the input x t : 

y, =Ai£f(Bx t ) + cij> (9) 



1 

t 



-y^ = A Iyf(B Xi 



B T + C. (10) 



We note that by making nonlinear activations f (•) zero mean in Eq. (j3), we disallow the nonlinear 
mapping Af (B ) to affect the expected output y t , that is, to compete with the bias term. Similarly, 
by making the nonlinear activations f (•) zero slope in Eq. (HJ, we disallow the nonlinear mapping 
Af (B ) to affect the expected dependency of the input, that is, to compete with the linear mapping 
C. In traditional neural networks, the linear dependencies (expected dy t / <9x t ) are modeled by many 
competing paths from an input to an output (e.g. via each hidden unit), whereas our architecture 
gathers the linear dependencies to be modeled only by C. We argue that less competition between 
parts of the model will speed up learning. Another explanation for choosing these transformations is 
that they make the nondiagonal parts of the Fisher information matrix closer to zero (see Section[3]l. 

The goal of Equation |5]) is to normalize both the output signals (similarly as data is often normalized 
as a preprocessing step - see,e.g., [8]) and the slopes of the output signals of each hidden unit at 
the same time. This is motivated by observing that the diagonal of the Fisher information matrix 
contains elements with both the signals and their slopes. By these normalizations, we aim pushing 
these diagonal elements more similar to each other. As we cannot normalize both the signals and 
the slopes to unity at the same time, we normalize their geometric mean to unity. 

The effect of the first two transformations can be compensated exactly by updating the shortcut 
mapping C by 

C new = C id — A(a new — C*old)B 

-A(/3 new -/3 old )[0 0...1], (11) 

where a. is a matrix with elements a, t on the diagonal and one empty row below for the bias term, 
and (3 is a column vector with components and one zero below for the bias term. The third 
transformation can be compensated by 

A new — A i ( ]~C i ( j')' new > (12) 

where 7 is a diagonal matrix with 7$ as the diagonal elements. 

Schraudolph lfT2l[TTl proposed centering the factors of the gradient to zero mean. It was argued that 
deviations from the gradient fall into the linear subspace that the shortcut connections operate in, so 
they do not harm the overall performance. Transforming the nonlinearities as proposed in this paper 
has a similar effect on the gradient. Equation ^ corresponds to Schraudolph's activity centering 
and Equation |4]) corresponds to slope centering. 

3 Theoretical Comparison to a Second- Order Method 

Second-order optimization methods, such as the natural gradient [ 1 ] or Newton's method, decrease 
the number of required iterations compared to the basic gradient descent, but they cannot be easily 
used with large models due to heavy computations with large matrices. The natural gradient is 
the basic gradient multiplied from the left by the inverse of the Fisher information matrix. Using 
basic gradient descent can thus be seen as using the natural gradient while approximating the Fisher 
information with a unit matrix multiplied by the inverse learning rate. We will show how the first two 
proposed transformations move the non-diagonal elements of the Fisher information matrix closer 
to zero, and the third transformation makes the diagonal elements more similar in scale, thus making 
the basic gradient behave closer to the natural gradient. 

The Fisher information matrix contains elements 

/d 2 log P ( y f |x t , A,B,C 



Ga = 2J < ' ) , (13) 
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where (•) is the expectation over the Gaussian distribution of noise e t in Equation ([TJ, and vector 8 
contains all the elements of matrices A, B, and C. Note that here y t is a random variable and thus 
the Fisher information does not depend on the output data. The Hessian matrix is closely related to 
the Fisher information, but it does depend on the output data and contains more terms, and therefore 
we show the analysis on the simpler Fisher information matrix. 



The elements in the Fisher information matrix are 

d d 



dciij dai'j' 



i'jki 

iog P -\ i ^.(b^/HtyxO i' = i, 



(14) 



where a,j is the ijth element of matrix A, fj is the jth nonlinearity, and bj is the jth row vector of 
matrix B. Similarly 
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The cross terms are 
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(15) 
(16) 

(17) 

(18) 
(19) 



Now we can notice that Equations ( 14 - 19 1 contain factors such as fj(-), f'j(-), and xu- We ar- 
gue that by making the factors as close to zero as possible, we help in making nondiagonal ele- 
ments of the Fisher information closer to zero. For instance, E[fj(-)fj/(-)] = E[fj(-)]E[fj'(-)] + 
Cov[/j ■(■), fj> (•)], so assuming that the hidden units j andj' are representing different things, that is, 
fj (•) and fj'(-) are uncorrelated, the nondiagonal element of the Fisher information in Equation ( 14 1 
becomes exactly zero by using the transformations. When the units are not completely uncorrelated, 
the element in question will be only approximately zero. The same argument applies to all other 
elements in Equations ([15 - 19 1, some of them also highlighting the benefit of making the input data 
X( zero-mean. Naturally, it is unrealistic to assume that inputs x t , nonlinear activations f (•), and 
their slopes f'(-) are all uncorrelated, so the goodness of this approximation is empirically evaluated 
in the next section. 



The diagonal elements of the Fisher can be found in Equations ( 14 16 1 when i = i , j = j\ and 
k = k' . There we find /(-) 2 and f'(-) 2 that we aim to keep similar in scale by using the third 
transformation in Equation ([5j. 



4 Empirical Comparison to a Second-Order Method 

Here we investigate how linear transformations affect the gradient by comparing it to a second-order 
method, namely Newton's algorithm with a simple regularization to make the Hessian invertible. 

We compute an approximation of the Hessian matrix using finite difference method, in which case 
fc-th row vector of the Hessian matrix H is given by 

d(VE(8)) ^ VE(8 + 8<f> k ) - VE{6 - 5<j> k ) 
hfe = ^^" 25 ' (20) 

where <f> k = (0, 0, . . . , 1, . . . , 0) is a vector of zeros and 1 at the k-l\\ position, and the error func- 
tion E(8) = — J2t l°gp(y* I x t: 0)- The resulting Hessian might still contain some very small or 
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Figure 1: Comparison of (a) distributions of the eigenvalues of Hessians (2600 x 2600 matrix) and 
(b) angles compared to the second-order update directions using LTMLP and regular MLR In (a), 
the eigenvalues are distributed most evenly when using LTMLP. (b) shows that gradients of the 
transformed networks point to the directions closer to the second-order update. 

even negative eigenvalues which cause its inversion to blow up. Therefore we do not use the Hes- 
sian directly, but include a regularization term similarly as in the Levenberg-Marquardt algorithm, 
resulting in a second-order update direction 

AO = {H + fil^VEiO), (21) 

where I denotes the unit matrix. Basically, Equation ( f2T| combines the steepest descent and the 
second-order update rule in such a way, that when /u gets small, the update direction approaches the 
Newton's method and vice versa. 

Computing the Hessian is computationally demanding and therefore we have to limit the size of 
the network used in the experiment. We study the MNIST handwritten digit classification problem 
where the dimensionality of the input data has been reduced to 30 using PCA with a random rotation 
ifTUl . We use a network with two hidden layers with architecture 30-25-20-10. The network was 
trained using the standard gradient descent with weight decay regularization. Details of the training 
are given in the appendix. 

In what follows, networks with all three transformations (LTMLP, linearly transformed multi-layer 
perceptron network), with two transformations (no-gamma where all 7, are fixed to unity) and a 
network with no transformations (regular, where we fix on = 0, /3j = 0, and 7, = 1) were compared. 
The Hessian matrix was approximated according to Equation |20} 10 times in regular intervals 
during the training of networks. All figures are shown using the approximation after 4000 epochs of 
training, which roughly corresponds to the midpoint of learning. However, the results were parallel 
to the reported ones all along the training. 

We studied the eigenvalues of the Hessian matrix (2600 x 2600) and the angles between the methods 
compared and second-order update direction. The distribution of eigenvalues in Figure [T^ for the 
networks with transformations are more even compared to the regular MLP Furthermore, there 
are fewer negative eigenvalues, which are not shown in the plot, in the transformed networks. In 
Figure [TJi, the angles between the gradient and the second-order update direction are compared as 
a function of [i in Equation pi) . The plots are cut when H + fil ceases to be positive definite as 
fi decreases. Curiously, the update directions are closer to the second-order method, when 7 is left 
out, suggesting that 7s are not necessarily useful in this respect. 

Figure [2] shows histograms of the diagonal elements of the Hessian after 4000 epochs of training. 
All the distributions are bimodal, but the distributions are closer to unimodal when transformations 
are used (subfigures (a) and (b) Yj Furthermore, the variance of the diagonal elements in log-scale 
is smaller when using LTMLP, of = 0.90, compared to the other two, of = 1.71 and of = 
1.43. This suggests that when transformations are used, the second-order update rule in Equation 
( pTj i corrects different elements of the gradient vector more evenly compared to a regular back- 
propagation learning, implying that the gradient vector is closer to the second-order update direction 
when using all the transformations. 

'it can be also argued whether (a) is more unimodal compared to (b). 
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Figure 2: Comparison of distributions of the diagonal elements of Hessians. Coloring according to 
legend in (c) shows which layers to corresponding weights connect (1 = input, 4 = output). Diagonal 
elements are most concentrated in LTMLP and most spread in the regular MLP network. Notice the 
logarithmic x-axis. 



To conclude this section, there is no clear evidence in way or another whether the addition of 7 ben- 
efits the back-propagation learning with only a and (3. However, there are some differences between 
these two approaches. In any case, it seems clear that transforming the nonlinearities benefits the 
learning compared to the standard back-propagation learning. 



5 Experiments: MNIST Classification 

We use the proposed transformations for training MLP networks for MNIST classification task. 
Experiments are conducted without pretraining, weight-sharing, enhancements of the training set or 
any other known tricks to boost the performance. No weight decay is used and as only regularization 
we add Gaussian noise with a = 0.3 to the training data. Networks with two and three hidden 
layers with architechtures 784-800-800-10 (solid lines) and 784-400-400-400-10 (dashed lines) 
are used. Details are given in the appendix. 

Figure [3] shows the results as number of errors in classifying the test set of 10 000 samples. The 
results of the regular back-propagation without transformations, shown in blue, are well in line with 
previously published result for this task. When networks with same architecture are trained using 
the proposed transformations, the results are improved significantly. However, adding 7 in addition 
to previously proposed a and (3 does not seem to affect results on this data set. The best results, 1 12 
errors, is obtained by the smaller architecture without 7 and for the three-layer architecture with 7 
the result is 1 14 errors. The learning seems to converge faster, especially in the three-layer case, 
with 7. The results are in line what was obtained in iflOll where the networks were regularized more 
thoroughly. These results show that it is possible to obtain results comparable to dropout networks 
(see J5 j) using only minimal regularization. 



6 Experiments: MNIST Autoencoder 

Previously, we have studied an auto-encoder network using two transformations, a and /3, in iflOl . 
Now we use the same auto-encoder architecture, 784-500-250-30-250-500-784. Adding the third 
transformation 7 for training the auto-encoder poses problems. Many hidden neurons in decoding 
layers (i.e., 4th and 5th hidden layers) tend to be relatively inactive in the beginning of training, 
which induces corresponding 7s to obtain very large values. In our experiments, auto-encoder with 
7s eventually diverge despite simple constraint we experimented with, such as 7j < 100. This be- 
havior is illustrated in Figure|4] The subfigure (a) shows the distribution of variances of outputs of all 
hidden neurons in MNIST classification network used in Section [5] given the MNIST training data. 
The corresponding distribution for hidden neurons in the decoder part of the auto-encoder is shown 
in the subfigure (b). The "dead neurons" can be seen as a peak in the origin. The corresponding 7s, 
constrained ji < 100, can be seen in the subfigure (c). We hypothesize that this behavior is due to 
the fact, that in the beginning of the learning there is not much information reaching the bottleneck 
layer through the encoder part and thus there is nothing to learn for the decoding neurons. According 
to our tentative experiments, the problem described above may be overcome by disabling 7s in the 
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Figure 3: The error rate on the MNIST test set for LTMLP training, LTMLP without 7 and regular 
back-propagation. The solid lines show results for networks with two hidden layers of 800 neurons 
and the dashed lines for networks with three hidden layers of 400 neurons. 
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(a) MNIST classification (b) MNIST auto-encoder (c) MNIST auto-encoder 



Figure 4: Histograms of (a-b) variation of output of hidden neurons given the MNIST training data 
and (c) 7s of the decoder part (4th and 5th hidden layer) in the MNIST auto-encoder, (a) shows 
a healthy distributions of variances, whereas in (b), which includes only variances of the decoder 
part, there are many "dead neurons". These neurons induce corresponding 7s, histogram of which 
is shown in (c), to blow up which eventually lead to divergence. 



decoder network (i.e., fix 7 = 1). However, this does not seem to speed up the learning compared 
to our earlier results with only two transformations in (10). It is also be possible to experiment with 
weight-sharing or other constraints to overcome the difficulties with 7s. 

7 Discussion and Conclusions 

We have shown that introducing linear transformation in nonlinearities significantly improves the 
back-propagation learning in (deep) MLP networks. In addition to two transformation proposed 
earlier in ifTOl . we propose adding a third transformation in order to push the Fisher information 
matrix closer to unit matrix (apart from its scale). The hypothesis proposed in [ 10 1, that the transfor- 
mations actually mimic a second-order update rule, was confirmed by experiments comparing the 
networks with transformations and regular MLP network to a second-order update method. How- 
ever, in order to find out whether the third transformation, 7, we proposed in this paper, is really 
useful, more experiments ought to be conducted. It might be useful to design experiments where 
convergence is usually very slow, thus revealing possible differences between the methods. As hy- 
perparameter selection and regularization are usually nuisance in practical use of neural networks, 
it would be interesting to see whether combining dropouts |5 1 and our transformations can provide 
a robust framework enabling training of robust neural networks in reasonable time. 
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The effect of the first two transformations is very similar to gradient factor centering Ifl2l fTTI . but 
transforming the model instead of the gradient makes it easier to generalize to other contexts: When 
learning by by MCMC, variational Bayes, or by genetic algorithms, one would not compute the 
basic gradient at all. For instance, consider using the Metropolis algorithm on the weight matrices, 
and expecially matrices A and B. Without transformations, the proposed jumps would affect the 
expected output y t and the expected linear dependency dy t /dx t in Eqs. (|9]l-(fT0]i, thus often leading 
to low acceptance probability and poor mixing. With the proposed transformations included, longer 
proposed jumps in A and B could be accepted, thus mixing the nonlinear part of the mapping faster. 
For further discussion, see ifTUl , Section 6. The implications of the proposed transformations in 
these other contexts are left as future work. 
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Appendix 
Details of Section 

In experiments of Section |4] networks with all three transformations (LTMLP), only a and f3 (no- 
gamma) and network with no transformations (regular) were compared. Full batch training without 
momentum was used to make things as simple as possible. The networks were regularized using 
weight decay and adding Gaussian noise to the training data. Three hyperparameters, weight de- 
cay term, input noise variance and learning rate, were validated for all networks separately. The 
input data was normalized to zero mean and the network was initialized as proposed in [3], that is, 
the weights were drawn from a uniform distribution between ±^/6/y/rij + nj+i, where rij is the 
number of neurons on the jth layer. 
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Figure 5: Comparison of (a) training and (b) test errors of the algorithms using the MNIST data 
in the experiment comparing them to the second-order method. Note how the best learning for the 
regular MLP is relatively high, leading to oscillations until it is annealed towards the end. 



We sampled the three hyperparameters randomly (given our best guess intervals) for 500 runs and 
selected the median of the runs that resulted in the best 50 validation errors as the hyperparameters. 
Resulting hyperparameters are listed in Table [T] Notable differences occur in step sizes, as it seems 
that networks with transformations allow using significantly larger step size which in turn results in 
more complete search in the weight space. 

Our weight updates are given by 

6 T = 6T- 1 - e r V6/ r . (22) 
where the learning rate on iteration r, e T , is given by 

r < T/2 
2(l-f) Eo r>T/2 

that is, the learning rate starts decreasing linearly after the midpoint of the given training time T. 
Furthermore, the learning rate e T is dampened for shortcut connection weights by multiplying with 
(|) , where s is number of skipped layers as proposed in 1 10| 2 Figure 5 shows training and test 
errors for the networks. The LTMLP obtains the best results although there is no big difference 
compared to training without 7. 

Details of Section [5] 

The MNIST dataset consists of 28 x 28 images of hand-drawn digits. There are 60 000 training 
samples and 10 000 test samples. We experimented with two networks with two and three hidden 
layers and number of hidden neurons by arbitrary choice. Training was done in minibatch mode with 
1000 samples in each batch and transformations are updated on every iteration using the current 
minibatch with using |6]m[8]). This seems to speed up learning compared to the approach in ifTUl 
where transformations were updated only occasionally with the full training data. Random Gaussian 
noise with a = 0.3 was injected to the training data in the beginning of each epoch. 



2 This heuristic is not well supported by analysis of Figure[2]and could be re-examined. 



Table 1 : Hyperparameters for the neural networks 





LTMLP 


no-gamma 


regular 


weight decay 


4.6 x 10- b 


1.3 x 10~ b 


3.9 x 10~ b 


noise 


0.31 


0.36 


0.29 


step size 


1.2 


2.5 


0.45 
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Our weight update equations are given by: 



A0 T = V6» + p T A0 t - 

e r = qt-X _ £ r Ad 



,4-1 



(24) 
(25) 



where 




Jp/ + (l-y)p t<T 
Pf t > T 



e t<T 
eof T - T r>T 



(26) 



(27) 



In the equations above, T is a "bum-in time" where momentum p T is increased from starting value 
Po = 0.5 to pf — 0.9 and learning rate e = So is kept constant. When t > T momentum is kept 
constant and learning rate starts decreasing exponentially with / = 0.9. Hyperparameters were not 
validated but chosen by arbitrary guess such that learning did not diverge. For the regular training, 
£o = 0.05 was selected since it diverged with higher learning rates. Then according to lessons 
learned, e.g. in Section 4] e = 0.3 was set for LTMLP with 7 and e = 0.7 for the variant with no 
7. Basically, it seems that transformations allow using higher learning rates and thus enable faster 
convergence. 
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